AITopics | reference score

Collaborating Authors

reference score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

Spiliopoulou, Evangelia, Fogliato, Riccardo, Burnsky, Hanna, Soliman, Tamer, Ma, Jie, Horwood, Graham, Ballesteros, Miguel

arXiv.org Artificial IntelligenceAug-12-2025

Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (>5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations.

completion, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.06709

Country: North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Aligned Textual Scoring Rules

Lu, Yuxuan, Wu, Yifan, Hartline, Jason, Curry, Michael J.

arXiv.org Artificial IntelligenceJul-9-2025

Scoring rules elicit probabilistic predictions from a strategic agent by scoring the prediction against a ground truth state. A scoring rule is proper if, from the agent's perspective, reporting the true belief maximizes the expected score. With the development of language models, Wu and Hartline (2024) proposes a reduction from textual information elicitation to the numerical (i.e. probabilistic) information elicitation problem, which achieves provable properness for textual elicitation. However, not all proper scoring rules are well aligned with human preference over text. Our paper designs the Aligned Scoring rule (ASR) for text by optimizing and minimizing the mean squared error between a proper scoring rule and a reference score (e.g. human score). Our experiments show that our ASR outperforms previous methods in aligning with human preference while maintaining properness.

large language model, machine learning, reference score, (20 more...)

arXiv.org Artificial Intelligence

2507.06221

Country: North America > United States (1.00)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Games (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
(2 more...)

Add feedback

Assessment of L2 Oral Proficiency using Speech Large Language Models

Ma, Rao, Qian, Mengjie, Tang, Siyuan, Bannò, Stefano, Knill, Kate M., Gales, Mark J. F.

arXiv.org Artificial IntelligenceMay-28-2025

The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.21148

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.86)

Industry: Education > Assessment & Standards (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Using Natural Language Explanations to Rescale Human Judgments

Wadhwa, Manya, Chen, Jifan, Li, Junyi Jessy, Durrett, Greg

arXiv.org Artificial IntelligenceNov-14-2023

The rise of large language models (LLMs) has brought a critical need for high-quality human-labeled data, particularly for processes like human feedback and evaluation. A common practice is to label data via consensus annotation over crowdworker judgments. However, annotators' judgments for subjective tasks can differ in many ways: they may have different qualitative judgments about an example, and they may map those to a labeling scheme in different ways. We show that these nuances can be captured by natural language explanations, and propose a method to rescale ordinal annotations and explanations using LLMs. Specifically, we feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. These scores should reflect the annotators' underlying assessments of the example. The rubric can be designed or modified after annotation, and include distinctions that may not have been known when the original error taxonomy was devised. We explore our technique in the context of rating system outputs for a document-grounded question answering task, where LLMs achieve near-human performance. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.

annotator, explanation, information, (15 more...)

arXiv.org Artificial Intelligence

2305.1477

Country:

North America > United States > Missouri > Jackson County > Kansas City (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Thailand (0.04)
(21 more...)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Sports (1.00)
Law (1.00)
Government (1.00)
Banking & Finance (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Few-shot Anomaly Detection in Text with Deviation Learning

Das, Anindya Sundar, Ajay, Aravind, Saha, Sriparna, Bhuyan, Monowar

arXiv.org Artificial IntelligenceAug-22-2023

Most current methods for detecting anomalies in text concentrate on constructing models solely relying on unlabeled data. These models operate on the presumption that no labeled anomalous examples are available, which prevents them from utilizing prior knowledge of anomalies that are typically present in small numbers in many real-world applications. Furthermore, these models prioritize learning feature embeddings rather than optimizing anomaly scores directly, which could lead to suboptimal anomaly scoring and inefficient use of data during the learning process. In this paper, we introduce FATE, a deep few-shot learning-based framework that leverages limited anomaly examples and learns anomaly scores explicitly in an end-to-end method using deviation learning. In this approach, the anomaly scores of normal examples are adjusted to closely resemble reference scores obtained from a prior distribution. Conversely, anomaly samples are forced to have anomalous scores that considerably deviate from the reference score in the upper tail of the prior. Additionally, our model is optimized to learn the distinct behavior of anomalies by utilizing a multi-head self-attention layer and multiple instance learning approaches. Comprehensive experiments on several benchmark datasets demonstrate that our proposed approach attains a new level of state-of-the-art performance.

anomaly score, data mining, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2308.1178

Country:

Europe > Sweden > Västerbotten County > Umeå (0.04)
North America > United States > North Carolina > Watauga County > Boone (0.04)
Europe > Spain > Basque Country > Biscay Province > Bilbao (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

Multiple output samples for each input in a single-output Gaussian process

Wong, Jeremy H. M., Zhang, Huayun, Chen, Nancy F.

arXiv.org Artificial IntelligenceJun-5-2023

The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty information. This differs from a multi-output GP, as all output samples are from the same task here. The output density function is formulated to be the joint likelihood of observing all output samples, and latent variables are not repeated to reduce computation cost. The test set predictions are inferred similarly to a standard GP, with a difference being in the optimised hyper-parameters. This is evaluated on speechocean762, showing that it allows the GP to compute a test set output distribution that is more similar to the collection of reference outputs from the multiple human raters.

artificial intelligence, machine learning, output sample, (18 more...)

arXiv.org Artificial Intelligence

2306.02719

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
(11 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Analyzing Ordinal Data in SAS using the Multinomial Distribution.

#artificialintelligenceMar-12-2022, 02:46:22 GMT

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Below, I will use a dataset containing the diarrhea scores of pigs to show how to analyze ordinal data.

assumption, category, reference score, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence (0.50)

Add feedback

IITP at MEDIQA 2019: Systems Report for Natural Language Inference, Question Entailment and Question Answering

Bandyopadhyay, Dibyanayan, Gain, Baban, Saikh, Tanik, Ekbal, Asif

arXiv.org Artificial IntelligenceJun-14-2019

This paper presents the experiments accomplished as a part of our participation in the MEDIQA challenge, an (Abacha et al., 2019) shared task. We participated in all the three tasks defined in this particular shared task. The tasks are viz. i. Natural Language Inference (NLI) ii. Recognizing Question Entailment(RQE) and their application in medical Question Answering (QA). We submitted runs using multiple deep learning based systems (runs) for each of these three tasks. We submitted five system results in each of the NLI and RQE tasks, and four system results for the QA task. The systems yield encouraging results in all three tasks. The highest performance obtained in NLI, RQE and QA tasks are 81.8%, 53.2%, and 71.7%, respectively.

accuracy, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

1906.06332

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback